Text Mining Kickstarter Projects

Overview

Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowd funding platform focused on creativity. The company’s stated mission is to “help bring creative projects to life”.

Kickstarter has reportedly received almost $6 billion in pledges from 20 million backers to fund more than 200,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.

For this assignment, I am asking you to analyze the descriptions of kickstarter projects to identify commonalities of successful (and unsuccessful projects) using the text mining techniques we covered in the past two lectures.

Data

The dataset for this assignment is taken from webroboto.io ‘s repository. They developed a scrapper robot that crawls all Kickstarter projects monthly since 2009. I noticed that the most recent crawls appear to be incomplete, so we will take data from the last complete crawl on 2021-05-17.

To simplify your task, I have downloaded the files and partially cleaned the scraped data. In particular, I converted several JSON columns, corrected some obvious data issues, and removed some variables that are not of interest (or missing frequently), and removed some duplicated project entries. I have also subsetted the data to only contain projects with locations set to the United States (to have only English language and USD denominated projects). Some data issues surely remain, so please adjust as you find it necessary to complete the analysis.

The data is contained in the file kickstarter_projects_2021_05.csv and contains about 131k projects and about 20 variables.

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom        0.7.9      ✓ rsample      0.1.1 
## ✓ dials        0.0.10     ✓ tibble       3.1.6 
## ✓ ggplot2      3.3.5      ✓ tidyr        1.2.0 
## ✓ infer        1.0.0      ✓ tune         0.1.6 
## ✓ modeldata    0.1.1      ✓ workflows    0.2.4 
## ✓ parsnip      0.1.7      ✓ workflowsets 0.1.0 
## ✓ purrr        0.3.4      ✓ yardstick    0.0.8 
## ✓ recipes      0.1.17
## Warning: package 'tidyr' was built under R version 4.1.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard()  masks scales::discard()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ stringr 1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x scales::col_factor() masks readr::col_factor()
## x purrr::discard()     masks scales::discard()
## x dplyr::filter()      masks stats::filter()
## x stringr::fixed()     masks recipes::fixed()
## x dplyr::lag()         masks stats::lag()
## x yardstick::spec()    masks readr::spec()
## load dataset 
setwd('/Users/nikkigerjarusak/Documents/GitHub/assignment-3---kickstarter-nikkigsak/')
df <- read_csv("kickstarter_projects_2021-05.csv")
## Rows: 130768 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (12): blurb, country, country_displayable_name, currency, name, slug, s...
## dbl   (6): backers_count, converted_pledged_amount, goal, id, pledged, usd_e...
## lgl   (3): is_starrable, spotlight, staff_pick
## date  (4): created_at, deadline, launched_at, state_changed_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Tasks for the Assignment

1. Identifying Successful Projects

a) Success by Category

There are several ways to identify success of a project:
- State (state): Whether a campaign was successful or not.
- Pledged Amount (pledged)
- Achievement Ratio: The variable achievement_ratio is calculating the percentage of the original monetary goal reached by the actual amount pledged (that is pledged\goal *100).
- Number of backers (backers_count)
- How quickly the goal was reached (difference between launched_at and state_changed_at) for those campaigns that were successful.

Use two of these measures to visually summarize which categories were most successful in attracting funding on kickstarter. Briefly summarize your findings.

cat_1 <- df %>%
  group_by(top_category) %>% 
  summarize(total_pledged = sum(pledged))

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
success1 <- ggplot(cat_1, aes(x = top_category, y = total_pledged, fill = top_category)) +
  geom_bar(position = "dodge", stat = "identity") +  theme(axis.text.x = element_text(angle = 45)) +
  labs (x = "Categories", y = "Amount Pledged",
        title = "Amount Pledged Per Category")
ggplotly(success1)

In the plot above, we can see that the categories that get the largest pledged amount are by far technology, followed by film & video and games.

cat_2 <- df %>%
  group_by(top_category) %>% 
  summarize(total_backers = sum(backers_count))

success2 <- ggplot(cat_2, aes(x = top_category, y = total_backers, fill = top_category)) +
  geom_bar(position = "dodge", stat = "identity") +  theme(axis.text.x = element_text(angle = 45)) +
  labs (x = "Categories", y = "Number of Backers",
        title = "Backers Per Category")
ggplotly(success2)

In measuring success by the number of backers, technology is also the category with the most. However, the discrepancy between technology and the following two categories games and film & video is not as large as when measuring success by amount pledged.

BONUS ONLY: b) Success by Location

Now, use the location information to calculate the total number of successful projects by state (if you are ambitious, normalize by population). Also, identify the Top 50 “innovative” cities in the U.S. (by whatever measure you find plausible). Provide a leaflet map showing the most innovative states and cities in the U.S. on a single map based on these information.

locat_1 <- df %>%
  group_by(location_state) %>%
  summarize(total_pledged = sum(pledged)) %>%
  arrange(desc(total_pledged)) 

top_cities <- df %>%
  group_by(location_town) %>%
  summarize(total_pledged = sum(pledged)) %>%
  arrange(desc(total_pledged)) %>%
  slice(1:50)

2. Writing your success story

Each project contains a blurb – a short description of the project. While not the full description of the project, the short headline is arguably important for inducing interest in the project (and ultimately popularity and success). Let’s analyze the text.

a) Cleaning the Text and Word Cloud

To reduce the time for analysis, select the 1000 most successful projects and a sample of 1000 unsuccessful projects (by a metric of your choice). Use the cleaning functions introduced in lecture (or write your own in addition) to remove unnecessary words (stop words), syntax, punctuation, numbers, white space etc. Note, that many projects use their own unique brand names in upper cases, so try to remove these fully capitalized words as well (since we are aiming to identify common words across descriptions). Create a document-term-matrix.

## df$pledged <- as.integer(df$pledged)
##drop_na(df, pledged)

## select top 1000 by amount pledged
top_1000 <- df %>% 
  filter(state=='successful') %>%
  arrange(desc(pledged)) %>%
  ungroup() %>%
  slice(1:1000)

## select bottom 1000 by amount pledged
bottom_1000 <- df %>% 
  filter(state=='failed') %>%
  arrange(pledged) %>%
  ungroup() %>%
  slice(1:1000)

## merge to create df
## data <- rbind(top_1000, bottom_1000)
## top 1000 successful projects
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
top_text <- top_1000$blurb
corp <- Corpus(VectorSource(top_text))

## clean text by removing characters
text_fun <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corp <- tm_map(corp, text_fun, "/")
## Warning in tm_map.SimpleCorpus(corp, text_fun, "/"): transformation drops
## documents
corp <- tm_map(corp, text_fun, "@")
## Warning in tm_map.SimpleCorpus(corp, text_fun, "@"): transformation drops
## documents
corp <- tm_map(corp, text_fun, "\\|")
## Warning in tm_map.SimpleCorpus(corp, text_fun, "\\|"): transformation drops
## documents
## lowercase
corp <- tm_map(corp, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corp, content_transformer(tolower)):
## transformation drops documents
## remove numbers
corp <- tm_map(corp, removeNumbers)
## Warning in tm_map.SimpleCorpus(corp, removeNumbers): transformation drops
## documents
## remove stopwords
corp <- tm_map(corp, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corp, removeWords, stopwords("english")):
## transformation drops documents
## remove punctuation
corp <- tm_map(corp, removePunctuation)
## Warning in tm_map.SimpleCorpus(corp, removePunctuation): transformation drops
## documents
## remove extra white spaces
corp <- tm_map(corp, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corp, stripWhitespace): transformation drops
## documents
## stemming
corp <- tm_map(corp, stemDocument)
## Warning in tm_map.SimpleCorpus(corp, stemDocument): transformation drops
## documents
## lemmatize
library(textstem)
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()
## 
## Attaching package: 'koRpus'
## The following object is masked from 'package:tm':
## 
##     readTagged
## The following object is masked from 'package:readr':
## 
##     tokenize
clean_corp <- lemmatize_words(corp)
library(tidytext)
## successful projects document-term-matrix

## build matrix
top_dtm <- DocumentTermMatrix(clean_corp) 
top_td <- tidy(top_dtm)


top_lemma <- top_td %>%
  mutate(lemma = lemmatize_words(term))
top_lemma <- top_lemma %>%
  group_by(lemma) %>% 
  summarize(count = sum(count))
## bottom 1000 failed projects
bottom_text <- bottom_1000$blurb
corp_1 <- Corpus(VectorSource(bottom_text))

## clean text by removing characters
text_fun <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corp_1 <- tm_map(corp_1, text_fun, "/")
## Warning in tm_map.SimpleCorpus(corp_1, text_fun, "/"): transformation drops
## documents
corp_1 <- tm_map(corp_1, text_fun, "@")
## Warning in tm_map.SimpleCorpus(corp_1, text_fun, "@"): transformation drops
## documents
corp_1 <- tm_map(corp_1, text_fun, "\\|")
## Warning in tm_map.SimpleCorpus(corp_1, text_fun, "\\|"): transformation drops
## documents
## lowercase
corp_1 <- tm_map(corp_1, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corp_1, content_transformer(tolower)):
## transformation drops documents
## remove numbers
corp_1 <- tm_map(corp_1, removeNumbers)
## Warning in tm_map.SimpleCorpus(corp_1, removeNumbers): transformation drops
## documents
## remove stopwords
corp_1 <- tm_map(corp_1, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corp_1, removeWords, stopwords("english")):
## transformation drops documents
## remove punctuation
corp_1 <- tm_map(corp_1, removePunctuation)
## Warning in tm_map.SimpleCorpus(corp_1, removePunctuation): transformation drops
## documents
## remove extra white spaces
corp_1 <- tm_map(corp_1, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corp_1, stripWhitespace): transformation drops
## documents
## stemming
corp_1 <- tm_map(corp_1, stemDocument)
## Warning in tm_map.SimpleCorpus(corp_1, stemDocument): transformation drops
## documents
## lemmatize
clean_corp1 <- lemmatize_words(corp_1)
## failed projects document-term-matrix

## build matrix
bottom_dtm <- DocumentTermMatrix(clean_corp1) 
bottom_td <- tidy(bottom_dtm)

bottom_lemma <- bottom_td %>%
  mutate(lemma = lemmatize_words(term))
bottom_lemma <- bottom_lemma %>%
  group_by(lemma) %>% 
  summarize(count = sum(count))

Provide a word cloud of the most frequent or important words (your choice which frequency measure you choose) among the most successful projects.

library(wordcloud)
## Loading required package: RColorBrewer
set.seed(1234)
wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq = 7,
          max.words=100, random.order=FALSE, 
          colors=brewer.pal(9, "Dark2"))
## Warning in brewer.pal(9, "Dark2"): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq =
## 7, : screen could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq =
## 7, : compact could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq =
## 7, : qualiti could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq =
## 7, : track could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq =
## 7, : travel could not be fit on page. It will not be plotted.

b) Success in words

Provide a pyramid plot to show how the words between successful and unsuccessful projects differ in frequency. A selection of 10 - 20 top words is sufficient here.

top_tdm1 <- TermDocumentMatrix(clean_corp) 
top_mat <- as.matrix(top_tdm1)
top_tdm_v <- sort(rowSums(top_mat), decreasing = TRUE)
top_tdm_d <- data.frame(word = names(top_tdm_v), freq = top_tdm_v)

bottom_tdm1 <-TermDocumentMatrix(clean_corp1)
bottom_mat <- as.matrix(bottom_tdm1)
bottom_tdm_v <- sort(rowSums(bottom_mat), decreasing = TRUE)
bottom_tdm_d <- data.frame(word = names(bottom_tdm_v), freq = bottom_tdm_v)

## create merged data frame
data = merge(x = top_tdm_d, y = bottom_tdm_d, by = 'word')
colnames(data)[colnames(data) %in% c("freq.x", "freq.y")] <- c("successful", "unsuccessful")
colnames(data) 
## [1] "word"         "successful"   "unsuccessful"
data$final <- data$successful + data$unsuccessful

## top 20
data_pyramid<-data %>% 
  arrange(desc(final)) %>%
  mutate(rank=row_number()) %>%
  filter(rank<=20)
## pyramid plot
library(plotrix)
## 
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
## 
##     rescale
pyr_plot <- pyramid.plot(data_pyramid$successful, data_pyramid$unsuccessful,
                  labels = data_pyramid$word,
             gap = 10,
             top.labels = c("Successful Words", " ", "Unsuccessful Words"),
             main = "Top 20 Words in Successful and Unsuccessful projects",
             laxlab = NULL,
             raxlab = NULL,
             unit = NULL,
             labelcex=0.8)

## 126 126

c) Simplicity as a virtue

These blurbs are short in length (max. 150 characters) but let’s see whether brevity and simplicity still matters. Calculate a readability measure (Flesh Reading Ease, Flesh Kincaid or any other comparable measure) for the texts. Visualize the relationship between the readability measure and one of the measures of success. Briefly comment on your finding.

require(quanteda)
## Loading required package: quanteda
## Warning: package 'quanteda' was built under R version 4.1.2
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following objects are masked from 'package:koRpus':
## 
##     tokens, types
## The following object is masked from 'package:tm':
## 
##     stopwords
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
require(dplyr)
require(quanteda.textstats)
## Loading required package: quanteda.textstats
## successful projects
top_corpus <- corpus(top_text)
FRE_top <- textstat_readability(top_corpus,
              measure=c('Flesch.Kincaid'))
FRE_top
##      document Flesch.Kincaid
## 1       text1     16.7223529
## 2       text2      8.3700000
## 3       text3      9.5500000
## 4       text4      8.3700000
## 5       text5     12.8285714
## 6       text6     10.4866667
## 7       text7     13.8600000
## 8       text8      6.0100000
## 9       text9      7.0478571
## 10     text10      5.8254348
## 11     text11      2.8955000
## 12     text12     12.8285714
## 13     text13     11.6800000
## 14     text14     12.3147368
## 15     text15     13.2517647
## 16     text16      8.1716667
## 17     text17      7.2095238
## 18     text18      8.5415385
## 19     text19     12.8285714
## 20     text20      8.7296154
## 21     text21     10.5809524
## 22     text22      6.7900000
## 23     text23      3.2602174
## 24     text24     10.3104348
## 25     text25     13.4500000
## 26     text26      6.2755556
## 27     text27     12.5576471
## 28     text28      7.3676316
## 29     text29      4.8610000
## 30     text30      5.8640909
## 31     text31     11.9616667
## 32     text32     12.9450000
## 33     text33      2.4700000
## 34     text34      4.8400000
## 35     text35     10.5809524
## 36     text36      7.6125000
## 37     text37      6.9363636
## 38     text38      6.6000000
## 39     text39     13.9523810
## 40     text40     13.5568421
## 41     text41      6.9368182
## 42     text42      8.0090909
## 43     text43     14.5142857
## 44     text44     13.5125000
## 45     text45      1.8200000
## 46     text46      9.6550000
## 47     text47     10.5000000
## 48     text48      8.3500000
## 49     text49      8.7800000
## 50     text50     14.9875000
## 51     text51      7.1900000
## 52     text52      7.3645652
## 53     text53     10.9350000
## 54     text54      2.7792593
## 55     text55      7.1602941
## 56     text56      4.7909091
## 57     text57     12.9357895
## 58     text58      9.3616667
## 59     text59     15.0761905
## 60     text60      1.8983333
## 61     text61      4.7913636
## 62     text62     13.9523810
## 63     text63     16.9900000
## 64     text64      4.8300000
## 65     text65      6.7800000
## 66     text66      7.1900000
## 67     text67     12.6050000
## 68     text68      8.1716667
## 69     text69     13.4500000
## 70     text70     10.7233333
## 71     text71      5.3646154
## 72     text72     12.3626087
## 73     text73      7.1400000
## 74     text74      3.8260000
## 75     text75      7.4727273
## 76     text76     10.7233333
## 77     text77     11.3000000
## 78     text78     11.3365217
## 79     text79     10.5000000
## 80     text80      5.3150000
## 81     text81     13.2955556
## 82     text82      3.6500000
## 83     text83     14.7989474
## 84     text84     13.4866667
## 85     text85     13.8600000
## 86     text86     11.5000000
## 87     text87     12.5576471
## 88     text88      4.0903704
## 89     text89      7.6518182
## 90     text90      6.1326316
## 91     text91      7.9957895
## 92     text92     13.4366667
## 93     text93      5.8066667
## 94     text94     10.4411111
## 95     text95     10.7233333
## 96     text96      6.3384783
## 97     text97      5.3621429
## 98     text98      6.4000000
## 99     text99      7.3645652
## 100   text100     12.8600000
## 101   text101      7.9886842
## 102   text102     11.6800000
## 103   text103      4.3890000
## 104   text104     11.8495652
## 105   text105      8.4128571
## 106   text106     15.4200000
## 107   text107     15.2200000
## 108   text108      7.6097619
## 109   text109      7.4727273
## 110   text110      6.2983333
## 111   text111      4.8434783
## 112   text112     12.3626087
## 113   text113     12.5083333
## 114   text114      9.8589474
## 115   text115     13.3904762
## 116   text116     11.6936842
## 117   text117     13.0733333
## 118   text118     13.2517647
## 119   text119      8.1800000
## 120   text120     13.5120000
## 121   text121     11.0900000
## 122   text122      8.5454545
## 123   text123     11.8675000
## 124   text124     13.4500000
## 125   text125      8.6097368
## 126   text126      7.6338462
## 127   text127     14.6300000
## 128   text128      9.6550000
## 129   text129     12.8354545
## 130   text130      8.7335714
## 131   text131      9.2843478
## 132   text132     13.5568421
## 133   text133     14.9875000
## 134   text134     11.0726316
## 135   text135     11.2272727
## 136   text136     14.6300000
## 137   text137      3.8972727
## 138   text138     13.7188889
## 139   text139     10.0985714
## 140   text140      7.6097619
## 141   text141     11.7842857
## 142   text142      7.7800000
## 143   text143     13.5568421
## 144   text144      5.0779412
## 145   text145     11.8495652
## 146   text146     12.0375000
## 147   text147      6.4859524
## 148   text148      9.1300000
## 149   text149     10.0190476
## 150   text150     13.5568421
## 151   text151      4.3316667
## 152   text152      7.6097619
## 153   text153      9.8573810
## 154   text154     10.1483333
## 155   text155      9.9666667
## 156   text156      7.9886842
## 157   text157      8.9600000
## 158   text158      2.3422222
## 159   text159     11.5000000
## 160   text160      9.8250000
## 161   text161      6.7607895
## 162   text162      7.1602941
## 163   text163      7.4086957
## 164   text164     11.4700000
## 165   text165     12.2666667
## 166   text166      9.2307895
## 167   text167      5.3330000
## 168   text168     13.4366667
## 169   text169     14.4147826
## 170   text170     11.5000000
## 171   text171      8.3533333
## 172   text172      7.6988235
## 173   text173     15.2200000
## 174   text174      9.8305263
## 175   text175     17.0300000
## 176   text176     13.0633333
## 177   text177      4.8233333
## 178   text178     17.9375000
## 179   text179     12.8600000
## 180   text180     12.8585185
## 181   text181      3.7550000
## 182   text182      8.3333333
## 183   text183     10.4192857
## 184   text184     15.3341176
## 185   text185      6.4661765
## 186   text186     11.2646154
## 187   text187     12.2990909
## 188   text188     12.2700000
## 189   text189      5.8522222
## 190   text190     12.8354545
## 191   text191     15.4500000
## 192   text192     12.6900000
## 193   text193      5.8453704
## 194   text194     15.4500000
## 195   text195      9.2307895
## 196   text196      8.3533333
## 197   text197      1.6073529
## 198   text198      7.0478571
## 199   text199      6.9311111
## 200   text200      8.5750000
## 201   text201     10.9558696
## 202   text202     10.4866667
## 203   text203      7.4718182
## 204   text204      8.5415385
## 205   text205      7.7800000
## 206   text206      9.0542857
## 207   text207      8.3929412
## 208   text208      7.9886842
## 209   text209      9.4571429
## 210   text210     14.7989474
## 211   text211      6.8515217
## 212   text212      9.2307895
## 213   text213      5.8066667
## 214   text214     11.7047619
## 215   text215     10.4866667
## 216   text216      6.1326316
## 217   text217     13.5568421
## 218   text218      7.3685714
## 219   text219     10.5809524
## 220   text220     13.3718182
## 221   text221      9.5500000
## 222   text222     17.2831579
## 223   text223     12.6271429
## 224   text224     13.9523810
## 225   text225     14.1422222
## 226   text226      7.7733333
## 227   text227     11.1428571
## 228   text228      7.0478571
## 229   text229     10.7233333
## 230   text230     10.6900000
## 231   text231      6.2150000
## 232   text232     12.8600000
## 233   text233     14.7989474
## 234   text234      6.7050000
## 235   text235      8.1716667
## 236   text236      7.3685714
## 237   text237      6.9363636
## 238   text238      9.4167391
## 239   text239     11.9100000
## 240   text240      9.2954762
## 241   text241     12.8600000
## 242   text242     10.5809524
## 243   text243      6.4000000
## 244   text244      8.1400000
## 245   text245     11.6800000
## 246   text246      8.3833333
## 247   text247      8.5415385
## 248   text248     12.3147368
## 249   text249     11.2646154
## 250   text250      6.3384783
## 251   text251     10.1545455
## 252   text252     12.2700000
## 253   text253     12.3147368
## 254   text254     13.0400000
## 255   text255     12.9450000
## 256   text256      5.3272727
## 257   text257      9.6181818
## 258   text258     14.4445455
## 259   text259     11.7627273
## 260   text260     10.8234783
## 261   text261      8.1716667
## 262   text262     17.4164706
## 263   text263      7.3685714
## 264   text264     10.2080000
## 265   text265      3.6533333
## 266   text266     11.1520000
## 267   text267      8.7300000
## 268   text268     13.3904762
## 269   text269      4.0030769
## 270   text270     11.4700000
## 271   text271     13.9523810
## 272   text272      6.6000000
## 273   text273      7.4425000
## 274   text274      6.7900000
## 275   text275      5.4200000
## 276   text276      6.7900000
## 277   text277      8.9600000
## 278   text278      7.3676316
## 279   text279     10.4346154
## 280   text280      7.1633333
## 281   text281      6.3384783
## 282   text282      2.4616848
## 283   text283      9.0818182
## 284   text284      7.3818421
## 285   text285     10.1536364
## 286   text286      8.8977778
## 287   text287     13.6733333
## 288   text288      8.1716667
## 289   text289     11.1520000
## 290   text290     16.6621053
## 291   text291      6.4000000
## 292   text292     10.7133333
## 293   text293     10.4866667
## 294   text294     10.9783333
## 295   text295      6.3105882
## 296   text296     10.3104348
## 297   text297      3.9971429
## 298   text298     10.0985714
## 299   text299     15.0761905
## 300   text300     11.0966667
## 301   text301     13.4500000
## 302   text302     11.7657143
## 303   text303     12.8285714
## 304   text304      5.8184615
## 305   text305      5.8522222
## 306   text306     12.2700000
## 307   text307      4.2623684
## 308   text308     14.3128571
## 309   text309     10.3104348
## 310   text310     11.0966667
## 311   text311      8.7335714
## 312   text312      3.7550000
## 313   text313      8.2114286
## 314   text314     12.7750000
## 315   text315      7.3645652
## 316   text316     11.5828571
## 317   text317      6.6200000
## 318   text318     10.4835526
## 319   text319      5.3621429
## 320   text320     15.0300000
## 321   text321      7.0047059
## 322   text322     14.1778947
## 323   text323      8.0090909
## 324   text324      4.9644444
## 325   text325     10.9811905
## 326   text326     12.2666667
## 327   text327     14.0400000
## 328   text328     14.9278261
## 329   text329     16.6621053
## 330   text330      4.9644444
## 331   text331      8.2582609
## 332   text332     11.0726316
## 333   text333      7.6988235
## 334   text334     12.5576471
## 335   text335      8.0090909
## 336   text336      1.3133333
## 337   text337      6.0100000
## 338   text338      6.3384783
## 339   text339     11.7627273
## 340   text340     11.8495652
## 341   text341      6.7050000
## 342   text342      8.1716667
## 343   text343      6.7252632
## 344   text344     12.2866667
## 345   text345      8.5415385
## 346   text346     10.3569231
## 347   text347      6.0100000
## 348   text348      4.3316667
## 349   text349     15.4033333
## 350   text350     12.8285714
## 351   text351      5.8636364
## 352   text352      7.5866667
## 353   text353      8.1400000
## 354   text354      3.9170000
## 355   text355      5.0779412
## 356   text356      7.8544118
## 357   text357     12.0191176
## 358   text358     13.3886957
## 359   text359     11.5000000
## 360   text360      8.1716667
## 361   text361     13.0800000
## 362   text362     12.9450000
## 363   text363      8.7566667
## 364   text364      9.7973913
## 365   text365     11.3250000
## 366   text366     14.9875000
## 367   text367     11.9616667
## 368   text368      9.5500000
## 369   text369      8.5454545
## 370   text370      8.0090909
## 371   text371      9.8250000
## 372   text372      3.0175000
## 373   text373      5.4333333
## 374   text374      5.2466667
## 375   text375      2.2800000
## 376   text376      9.2954762
## 377   text377      8.9600000
## 378   text378      9.0730769
## 379   text379      8.9036957
## 380   text380     10.6900000
## 381   text381     11.4700000
## 382   text382      8.7566667
## 383   text383      4.9644444
## 384   text384     12.9357895
## 385   text385      6.4000000
## 386   text386      6.2770000
## 387   text387     12.3626087
## 388   text388      3.6601852
## 389   text389      7.4727273
## 390   text390     11.1520000
## 391   text391      9.0542857
## 392   text392     11.7047619
## 393   text393      4.0763636
## 394   text394      6.5790909
## 395   text395      5.5238095
## 396   text396     12.8354545
## 397   text397     11.5000000
## 398   text398      7.1900000
## 399   text399     13.7188889
## 400   text400      3.6700000
## 401   text401      3.6413158
## 402   text402     15.4333333
## 403   text403     10.2088889
## 404   text404      9.6550000
## 405   text405     14.0400000
## 406   text406     17.4164706
## 407   text407      6.8515217
## 408   text408      6.7271429
## 409   text409     12.5680000
## 410   text410      6.8750000
## 411   text411      7.6600000
## 412   text412     13.3886957
## 413   text413      1.7888158
## 414   text414     11.3200000
## 415   text415      7.1900000
## 416   text416     11.0900000
## 417   text417      5.9022222
## 418   text418      7.6000000
## 419   text419     10.8644444
## 420   text420     16.3411111
## 421   text421     13.2165517
## 422   text422     16.0282353
## 423   text423     15.8030769
## 424   text424      8.0090909
## 425   text425     16.9966667
## 426   text426      8.9175000
## 427   text427      6.2723077
## 428   text428     12.8600000
## 429   text429      8.1716667
## 430   text430      9.7400000
## 431   text431     14.2700000
## 432   text432     10.9783333
## 433   text433     13.5568421
## 434   text434      5.2300000
## 435   text435      6.4000000
## 436   text436     13.5120000
## 437   text437      7.6600000
## 438   text438     12.3147368
## 439   text439      8.3833333
## 440   text440     11.1694118
## 441   text441     13.9458824
## 442   text442     10.1536364
## 443   text443     13.7188889
## 444   text444      9.2426471
## 445   text445      6.4859524
## 446   text446     13.3904762
## 447   text447     11.7047619
## 448   text448     14.6300000
## 449   text449      7.0478571
## 450   text450     13.9523810
## 451   text451     10.2088889
## 452   text452     11.9100000
## 453   text453      8.4128571
## 454   text454     10.4866667
## 455   text455      6.2983333
## 456   text456     10.4866667
## 457   text457     10.6900000
## 458   text458     12.3147368
## 459   text459      7.9886842
## 460   text460     10.1536364
## 461   text461     10.0190476
## 462   text462     -0.2834259
## 463   text463     10.0190476
## 464   text464      5.9675000
## 465   text465      9.9600000
## 466   text466      8.2792647
## 467   text467      8.0090909
## 468   text468     11.0900000
## 469   text469      7.8991667
## 470   text470      6.3500000
## 471   text471     10.5809524
## 472   text472      4.8233333
## 473   text473      6.4000000
## 474   text474      6.3105882
## 475   text475      9.0542857
## 476   text476     10.6900000
## 477   text477      4.7909091
## 478   text478     12.2666667
## 479   text479      9.5500000
## 480   text480      5.8636364
## 481   text481     11.7047619
## 482   text482     17.7933333
## 483   text483      9.7973913
## 484   text484     12.8600000
## 485   text485      4.0000000
## 486   text486     10.4752941
## 487   text487     14.5142857
## 488   text488     11.7047619
## 489   text489      5.3621429
## 490   text490     14.2700000
## 491   text491      6.6000000
## 492   text492     12.2666667
## 493   text493     11.0726316
## 494   text494     12.8285714
## 495   text495      9.2954762
## 496   text496     14.3744444
## 497   text497      3.7080000
## 498   text498      8.3833333
## 499   text499      6.9363636
## 500   text500      5.3565217
## 501   text501      9.0116667
## 502   text502     10.8884615
## 503   text503      3.6484211
## 504   text504      6.1255263
## 505   text505      9.1400000
## 506   text506      8.3533333
## 507   text507      5.6200000
## 508   text508     10.3500000
## 509   text509     12.4533333
## 510   text510      8.9600000
## 511   text511     16.6233333
## 512   text512     11.0726316
## 513   text513     18.8047059
## 514   text514      6.2983333
## 515   text515      9.8589474
## 516   text516      7.8776087
## 517   text517     14.3744444
## 518   text518      9.0818182
## 519   text519      5.8636364
## 520   text520      8.0090909
## 521   text521      8.1716667
## 522   text522     15.3341176
## 523   text523      8.4128571
## 524   text524     12.5576471
## 525   text525      1.1642105
## 526   text526     13.9523810
## 527   text527      7.1154545
## 528   text528      5.2300000
## 529   text529      7.6338462
## 530   text530      7.6338462
## 531   text531      5.3565217
## 532   text532      8.9600000
## 533   text533     12.8285714
## 534   text534     10.7233333
## 535   text535      8.2422222
## 536   text536     15.4333333
## 537   text537     15.7250000
## 538   text538     12.4533333
## 539   text539      8.1800000
## 540   text540     12.8354545
## 541   text541      6.2755556
## 542   text542      9.9100000
## 543   text543      4.8233333
## 544   text544      5.3330000
## 545   text545     13.4500000
## 546   text546     19.1685714
## 547   text547      7.0047059
## 548   text548      8.0081818
## 549   text549      9.0818182
## 550   text550      9.0818182
## 551   text551     11.8675000
## 552   text552     14.6300000
## 553   text553     11.6936842
## 554   text554     10.2088889
## 555   text555     10.3500000
## 556   text556     19.9800000
## 557   text557     13.5568421
## 558   text558      8.9600000
## 559   text559     13.5568421
## 560   text560     -3.4000000
## 561   text561      2.8245455
## 562   text562      8.0097727
## 563   text563     13.3718182
## 564   text564     11.0966667
## 565   text565      7.4727273
## 566   text566     12.8600000
## 567   text567      6.3105882
## 568   text568      8.3929412
## 569   text569      8.3929412
## 570   text570      8.3929412
## 571   text571     10.4752941
## 572   text572     11.1428571
## 573   text573     11.3365217
## 574   text574      4.8400000
## 575   text575     14.0400000
## 576   text576     12.3000000
## 577   text577      7.4727273
## 578   text578      8.3533333
## 579   text579      7.6125000
## 580   text580     11.7047619
## 581   text581      6.0065385
## 582   text582      7.1633333
## 583   text583      5.8254348
## 584   text584     14.0400000
## 585   text585      8.3700000
## 586   text586      0.5166667
## 587   text587      5.3272727
## 588   text588      9.4492308
## 589   text589     13.3718182
## 590   text590     10.0190476
## 591   text591      7.8188889
## 592   text592     11.3365217
## 593   text593      6.8515217
## 594   text594      3.8972727
## 595   text595      7.7733333
## 596   text596     12.8354545
## 597   text597     14.7989474
## 598   text598     11.3442857
## 599   text599     16.5900000
## 600   text600     13.0733333
## 601   text601     11.8635294
## 602   text602      7.3747368
## 603   text603      8.6150000
## 604   text604      6.8750000
## 605   text605     11.3000000
## 606   text606     10.6900000
## 607   text607     10.6900000
## 608   text608     14.7989474
## 609   text609     10.1545455
## 610   text610     13.7188889
## 611   text611     11.1428571
## 612   text612     12.0087500
## 613   text613      3.7181818
## 614   text614      8.4744444
## 615   text615     13.0800000
## 616   text616      7.2816667
## 617   text617     14.7989474
## 618   text618      9.0870588
## 619   text619     11.0966667
## 620   text620     10.9414286
## 621   text621     11.1520000
## 622   text622      6.0065385
## 623   text623     12.5680000
## 624   text624     11.3365217
## 625   text625      6.7536842
## 626   text626     13.4500000
## 627   text627      4.9644444
## 628   text628      8.0090909
## 629   text629     12.9450000
## 630   text630     10.3500000
## 631   text631      5.8636364
## 632   text632      6.7536842
## 633   text633      9.2954762
## 634   text634      5.8636364
## 635   text635     11.5000000
## 636   text636     15.9985714
## 637   text637     18.6750000
## 638   text638      9.7811765
## 639   text639      9.4492308
## 640   text640      9.8091176
## 641   text641      7.4727273
## 642   text642      8.3333333
## 643   text643     11.0900000
## 644   text644      6.9363636
## 645   text645      6.0100000
## 646   text646     13.3904762
## 647   text647     12.8285714
## 648   text648      9.8573810
## 649   text649     15.0300000
## 650   text650      6.8515217
## 651   text651     11.0900000
## 652   text652      9.8250000
## 653   text653     11.5000000
## 654   text654      9.6550000
## 655   text655      4.9700000
## 656   text656     10.7300000
## 657   text657      8.7566667
## 658   text658     11.9100000
## 659   text659      9.5533333
## 660   text660      5.8636364
## 661   text661      9.5533333
## 662   text662      9.5700000
## 663   text663     14.1014706
## 664   text664      6.2133333
## 665   text665     12.2700000
## 666   text666      9.7811765
## 667   text667      6.0100000
## 668   text668      9.8305263
## 669   text669      7.1900000
## 670   text670     11.7047619
## 671   text671     10.9783333
## 672   text672      6.3384783
## 673   text673      7.5866667
## 674   text674      7.1900000
## 675   text675     15.6380952
## 676   text676     14.0400000
## 677   text677      9.6172727
## 678   text678      5.9240476
## 679   text679     11.1520000
## 680   text680     12.8285714
## 681   text681      4.4200000
## 682   text682      5.3621429
## 683   text683     11.5000000
## 684   text684      6.7050000
## 685   text685      7.7800000
## 686   text686     11.7627273
## 687   text687     -2.8800000
## 688   text688      1.7888158
## 689   text689     13.3886957
## 690   text690      9.6172727
## 691   text691     12.1723077
## 692   text692     11.3365217
## 693   text693      5.0779412
## 694   text694     11.7627273
## 695   text695     11.6800000
## 696   text696      4.9107692
## 697   text697      7.7800000
## 698   text698     11.2263636
## 699   text699     14.4445455
## 700   text700      9.2557143
## 701   text701     13.3718182
## 702   text702     14.7989474
## 703   text703      8.3700000
## 704   text704      9.0818182
## 705   text705     11.0900000
## 706   text706     10.5000000
## 707   text707     16.0410526
## 708   text708      7.1900000
## 709   text709      4.8987273
## 710   text710     12.8600000
## 711   text711      8.9600000
## 712   text712      6.2770000
## 713   text713     13.9523810
## 714   text714      4.7132759
## 715   text715      3.3200000
## 716   text716     13.6115385
## 717   text717      6.0100000
## 718   text718     10.7133333
## 719   text719      9.1300000
## 720   text720     12.2666667
## 721   text721      6.0427273
## 722   text722      8.9600000
## 723   text723     14.5142857
## 724   text724     14.1422222
## 725   text725      9.0875000
## 726   text726      8.4744444
## 727   text727     12.4533333
## 728   text728      7.4727273
## 729   text729      9.2954762
## 730   text730     10.7751724
## 731   text731      7.6000000
## 732   text732     16.3411111
## 733   text733     11.5000000
## 734   text734      8.3533333
## 735   text735      6.2755556
## 736   text736      9.1300000
## 737   text737      4.6520000
## 738   text738      7.7800000
## 739   text739     15.6855556
## 740   text740      8.9600000
## 741   text741      7.6988235
## 742   text742     11.7627273
## 743   text743     16.0410526
## 744   text744     10.5625000
## 745   text745      4.2863043
## 746   text746      5.8842857
## 747   text747      5.8184615
## 748   text748     12.8600000
## 749   text749     19.4988235
## 750   text750      7.6600000
## 751   text751     13.4366667
## 752   text752      4.3838235
## 753   text753      6.1255263
## 754   text754     10.2362963
## 755   text755      4.8987273
## 756   text756     15.6855556
## 757   text757     10.8234783
## 758   text758      4.9107692
## 759   text759     18.9900000
## 760   text760     10.3569231
## 761   text761      7.4727273
## 762   text762     12.8354545
## 763   text763      6.1375000
## 764   text764     15.0761905
## 765   text765      1.0822609
## 766   text766     13.9523810
## 767   text767      8.3333333
## 768   text768      4.7909091
## 769   text769     13.0733333
## 770   text770      2.8800000
## 771   text771     11.2263636
## 772   text772      0.5200000
## 773   text773      7.9673684
## 774   text774     10.4866667
## 775   text775     12.8600000
## 776   text776      4.4500000
## 777   text777     11.0900000
## 778   text778     13.9876923
## 779   text779     10.9783333
## 780   text780      6.0100000
## 781   text781      8.5415385
## 782   text782     13.6905882
## 783   text783      7.6125000
## 784   text784      6.7252632
## 785   text785     12.3147368
## 786   text786     -1.8400000
## 787   text787     13.9081818
## 788   text788      9.0542857
## 789   text789      6.4000000
## 790   text790      7.6097619
## 791   text791      6.9311111
## 792   text792      9.8660526
## 793   text793     10.5625000
## 794   text794      6.7900000
## 795   text795     11.2263636
## 796   text796     11.0900000
## 797   text797      6.7050000
## 798   text798     11.4700000
## 799   text799      8.5445455
## 800   text800      6.4661765
## 801   text801     15.5550000
## 802   text802      8.2650000
## 803   text803      9.1800000
## 804   text804      9.5500000
## 805   text805      8.5454545
## 806   text806      3.2207692
## 807   text807     14.1422222
## 808   text808      8.7713043
## 809   text809      5.8184615
## 810   text810     10.8884615
## 811   text811     14.2500000
## 812   text812      5.3330000
## 813   text813      6.8515217
## 814   text814      6.7800000
## 815   text815      8.0090909
## 816   text816     11.3365217
## 817   text817     16.3411111
## 818   text818      5.8636364
## 819   text819      8.1716667
## 820   text820     14.2700000
## 821   text821     13.3731818
## 822   text822     10.1536364
## 823   text823     11.7627273
## 824   text824     11.0557353
## 825   text825     11.0900000
## 826   text826      9.8250000
## 827   text827     15.8100000
## 828   text828     12.0375000
## 829   text829     10.4752941
## 830   text830     11.1520000
## 831   text831     14.0400000
## 832   text832     11.7047619
## 833   text833      7.1633333
## 834   text834     11.3000000
## 835   text835      9.6172727
## 836   text836      8.0090909
## 837   text837     13.4700000
## 838   text838      8.3906522
## 839   text839      3.7186364
## 840   text840      5.8066667
## 841   text841      9.5500000
## 842   text842     13.3886957
## 843   text843     11.2272727
## 844   text844      6.2755556
## 845   text845     11.6800000
## 846   text846     10.6308824
## 847   text847     12.8285714
## 848   text848      7.6097619
## 849   text849     13.5568421
## 850   text850     11.2646154
## 851   text851     13.4866667
## 852   text852     12.3147368
## 853   text853     12.7750000
## 854   text854      4.6200000
## 855   text855      8.7713043
## 856   text856     11.9616667
## 857   text857      9.0818182
## 858   text858     14.5142857
## 859   text859     12.3626087
## 860   text860      4.7909091
## 861   text861      9.0822727
## 862   text862     11.1428571
## 863   text863     13.5568421
## 864   text864      7.7714286
## 865   text865     11.0966667
## 866   text866      5.8254348
## 867   text867      8.5454545
## 868   text868      6.0692308
## 869   text869     14.7989474
## 870   text870      2.8900000
## 871   text871     16.0282353
## 872   text872     11.7627273
## 873   text873      9.0875000
## 874   text874     15.9911538
## 875   text875      4.3316667
## 876   text876      8.3700000
## 877   text877     11.8635294
## 878   text878     20.5900000
## 879   text879     10.1536364
## 880   text880      3.3609091
## 881   text881     14.6800000
## 882   text882     14.6300000
## 883   text883     10.5000000
## 884   text884      9.6550000
## 885   text885      2.3422222
## 886   text886      7.3645652
## 887   text887      6.3384783
## 888   text888      6.4661765
## 889   text889     11.4700000
## 890   text890     10.6800000
## 891   text891      4.4925000
## 892   text892     14.1778947
## 893   text893      8.7335714
## 894   text894     12.2666667
## 895   text895     12.8311111
## 896   text896      8.1716667
## 897   text897     11.8495652
## 898   text898      5.0988462
## 899   text899     13.9081818
## 900   text900      4.7993478
## 901   text901      3.6500000
## 902   text902     10.4411111
## 903   text903      5.3277273
## 904   text904     11.3365217
## 905   text905      4.7909091
## 906   text906      7.7800000
## 907   text907      4.8400000
## 908   text908     11.6240000
## 909   text909     10.0900000
## 910   text910      9.0875000
## 911   text911      7.5500000
## 912   text912      5.8400000
## 913   text913      6.4000000
## 914   text914      5.8636364
## 915   text915      3.6413158
## 916   text916     11.1694118
## 917   text917     10.3500000
## 918   text918      5.2300000
## 919   text919     10.1400000
## 920   text920      6.2770000
## 921   text921      8.0283333
## 922   text922      3.7550000
## 923   text923     10.0985714
## 924   text924      9.6181818
## 925   text925      5.3123913
## 926   text926     15.6855556
## 927   text927      7.8188889
## 928   text928      3.8173913
## 929   text929      7.7733333
## 930   text930      6.5790909
## 931   text931     14.3128571
## 932   text932     13.3904762
## 933   text933      9.9100000
## 934   text934      5.3272727
## 935   text935      9.7650000
## 936   text936     16.6300000
## 937   text937     17.6522222
## 938   text938      9.9100000
## 939   text939     11.5474074
## 940   text940     12.4077778
## 941   text941     14.4445455
## 942   text942     10.1545455
## 943   text943      7.6000000
## 944   text944     14.6300000
## 945   text945     13.9876923
## 946   text946      6.9363636
## 947   text947     12.8600000
## 948   text948     10.1400000
## 949   text949      8.1400000
## 950   text950      8.3698077
## 951   text951     10.4346154
## 952   text952      7.9957895
## 953   text953      4.9107692
## 954   text954      6.0100000
## 955   text955     14.9875000
## 956   text956      3.6601852
## 957   text957     12.8354545
## 958   text958      9.1400000
## 959   text959      8.8977778
## 960   text960     11.4700000
## 961   text961      6.2755556
## 962   text962     12.8600000
## 963   text963     17.6636364
## 964   text964      9.0818182
## 965   text965      7.7733333
## 966   text966      5.8254348
## 967   text967     14.6466667
## 968   text968     11.8635294
## 969   text969      7.6338462
## 970   text970     13.2950000
## 971   text971     11.4700000
## 972   text972      2.4833333
## 973   text973      6.1375000
## 974   text974      9.0875000
## 975   text975      4.8434783
## 976   text976      4.5205556
## 977   text977      4.8434783
## 978   text978      6.4004545
## 979   text979     10.4728947
## 980   text980      4.8300000
## 981   text981      7.3463158
## 982   text982      4.9107692
## 983   text983      5.8066667
## 984   text984     11.0900000
## 985   text985     11.7842857
## 986   text986      7.1966667
## 987   text987      6.6860714
## 988   text988     10.8234783
## 989   text989      8.0090909
## 990   text990     13.9017391
## 991   text991     14.7989474
## 992   text992      2.1876923
## 993   text993     11.7627273
## 994   text994      6.9142308
## 995   text995      9.5500000
## 996   text996     10.0190476
## 997   text997      4.2545455
## 998   text998      6.2150000
## 999   text999      9.7855556
## 1000 text1000      9.5033333
## merge
topFRE <- cbind(top_1000, FRE_top)

## plot
plot_FRE <- ggplot(topFRE, aes(x = pledged, y=Flesch.Kincaid, color = pledged, size = pledged)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method='lm', color = 'red', se = F) + 
  ggthemes::theme_tufte() +
  labs(
    x = "Amount Pledged",
    y = "Flesch Kincaid Readability Score",
    title = "Relationship Between Amount Pledged and Readability of Text for Successful Projects")
ggplotly(plot_FRE)
## `geom_smooth()` using formula 'y ~ x'

In measuring project success by amount pledged, we can see from the plot that there is not much of a clear relationship between readability score and amount pledged. However, the data point that has the largest amount pledged does have a higher readability score.

3. Sentiment

Now, let’s check whether the use of positive / negative words or specific emotions helps a project to be successful.

a) Stay positive

Calculate the tone of each text based on the positive and negative words that are being used. You can rely on the Hu & Liu dictionary provided in lecture or use the Bing dictionary contained in the tidytext package (tidytext::sentiments). Visualize the relationship between tone of the document and success. Briefly comment.

## load sentiment dictionary
pos <- read.table('/Users/nikkigerjarusak/Documents/github/course_content/Lectures/Week09/data/dictionaries/positive-words.txt', as.is=T)
neg <- read.table('/Users/nikkigerjarusak//Documents/github/course_content/Lectures/Week09/data/dictionaries/negative-words.txt', as.is=T)

sentiment <- function(words=c("really great good stuff bad")){
  require(quanteda)
  tok <- quanteda::tokens(words)
  pos.count <- sum(tok[[1]]%in%pos[,1])
  neg.count <- sum(tok[[1]]%in%neg[,1])
  out <- (pos.count - neg.count)/(pos.count+neg.count)
}
set.seed(1234)

## get random sample of 2000
random_df <- sample_n(df, 2000)

## clean random sample 
random_df$blurb <- gsub("&amp", " ", random_df$blurb)
random_df$blurb <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", " ", random_df$blurb)
random_df$blurb <- gsub("@\\w+", " ", random_df$blurb)
random_df$blurb <- gsub("[[:punct:]]", " ", random_df$blurb)
random_df$blurb <- gsub("[[:digit:]]", " ", random_df$blurb)
random_df$blurb <- gsub("http\\w+", " ", random_df$blurb)
random_df$blurb <- gsub("[ \t]{2,}", " ", random_df$blurb)
random_df$blurb <- gsub("ˆ\\s+|\\s+$", " ", random_df$blurb)
random_df$blurb <- gsub("\n", " ", random_df$blurb)
random_cleaned <- random_df$blurb

random_sent <- data.frame(mapply(sentiment, random_cleaned))
random_sent <- data.frame(random_sent)
random_sent[is.na(random_sent)] <- 0
colnames(random_sent) <- c('Sentiment')
random_sent <- random_sent %>% 
  mutate(rn = row_number())

random_df <- random_df %>% 
  mutate(rn = row_number())
random_data <- merge(random_sent, random_df, by= "rn")

## sentiment vs amount pledged plot
sent_plot <- ggplot(data = random_data, aes(x = Sentiment, y = pledged)) + 
  geom_smooth(method= "loess",color= "black") +  labs(
    x = "Sentiment Score",
    y = "Amount Pledged",
    title = "Text Sentiment Score vs. Amount Pledged")
sent_plot
## `geom_smooth()` using formula 'y ~ x'

Documents with higher sentiment scores (more positive) tend to get a larger amount pledged to the project. However, we can see a dip at the end of the graph as documents that are too positive don’t achieve a higher amount pledged. We can see that there is a sweet spot of a sentiment score betweem .5 and 1.0 that receive the largest amount pledged.

b) Positive vs negative

Segregate all 2,000 blurbs into positive and negative texts based on their polarity score calculated in step (a). Now, collapse the positive and negative texts into two larger documents. Create a document-term-matrix based on this collapsed set of two documents. Generate a comparison cloud showing the most-frequent positive and negative words.

set.seed(1234)
## random 2000 sample
random_sample <- sample_n(df, 2000)
random_text <- random_sample$blurb

## clean using tidytext
random_text  <- tolower(random_text)
random_text <- removePunctuation(random_text)
random_text<- removeNumbers(random_text)
random_text <- stripWhitespace(random_text)
random_text <- removeWords(random_text, stopwords("en"))
random_text <- lemmatize_words(random_text)

## apply sentiment function
text_sentiment <- data.frame(mapply(sentiment, random_text))

## merge 
final_sent <- cbind(random_sample, text_sentiment)
final_sentiment <- final_sent$blurb 
final_sent_corp <- VCorpus(VectorSource(final_sentiment))

final_sent_corp <- tm_map(final_sent_corp,
    content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')))
final_sent_corp <- tm_map(final_sent_corp, PlainTextDocument)
final_sent_corp  <- tm_map(final_sent_corp, removeWords, stopwords("english"))
final_sent_corp <- tm_map(final_sent_corp, removePunctuation)
final_sent_corp  <- tm_map(final_sent_corp, removeNumbers)
final_sent_corp  <- tm_map(final_sent_corp, content_transformer(tolower))

final_sent_dtm <- DocumentTermMatrix(final_sent_corp)
final_sent_dtm  <- tidy(final_sent_dtm)

sent_clean <- final_sent_dtm  %>%
  mutate(clean_word = lemmatize_words(term))

bing <- get_sentiments('bing')
final_sent_clean <-merge(sent_clean, bing,by.x='clean_word',by.y='word')


words <- final_sent_clean  %>%
  group_by(clean_word)%>%
  summarize(count=n(), sentiment=first(sentiment)) %>%
  arrange(count)

## comparison word cloud
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
matrix<- acast(words, clean_word~sentiment, value.var='count', fill=0)
comparison.cloud(matrix, colors=c('blue', 'pink'))
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): romantic could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): success could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): wonderful could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): forsake could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): freeze could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): impractical
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): knock could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): miss could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): nightmare could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): numb could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): obstacle could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): outbreak could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): problem could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): resistance
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): revenge could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): shake could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): shock could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): smelly could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): smoke could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): split could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): threat could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): tragedy could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): trap could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): trick could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unemployed
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unknown could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unleash could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unravel could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unstable could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): vengeance could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): weird could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): appreciate
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): convenient
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): delight could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): empathy could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): enhance could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): guarantee could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): illuminate
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): magical could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): passionate
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): patient could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): powerful could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): precious could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): premier could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): reclaim could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): attraction
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): authentic could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): captivate could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): comfortable
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): compassion
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): dynamic could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): fortune could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): gorgeous could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): hopeful could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): inspiration
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): inspirational
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): legendary could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): luxury could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): promise could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): proper could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): protect could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): proud could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): secure could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): smooth could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): soulful could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): spiritual could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): strong could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): stylish could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): survivor could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): thank could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): wonder could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): angry could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): daunt could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): misfit could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): mope could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): slowly could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unable could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unfaithful
## could not be fit on page. It will not be plotted.

citation: referred to code found here: https://gist.github.com/rer145/8f31af53a9a8339dddbef93fd10e86ce

c) Get in their mind

Now, use the NRC Word-Emotion Association Lexicon in the tidytext package to identify a larger set of emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust). Again, visualize the relationship between the use of words from these categories and success. What is your finding?

nrc <-  get_sentiments("nrc")
unique(nrc$sentiment)
##  [1] "trust"        "fear"         "negative"     "sadness"      "anger"       
##  [6] "surprise"     "positive"     "disgust"      "joy"          "anticipation"
for (i in c("trust","fear","negative","sadness", "anger", "surprise","positive","disgust","joy","anticipation"))
{
assign(paste('nrc', '_', i, sep=""), dplyr::filter(nrc, sentiment==i))
}
## create NRC df
nrc_text <- data.frame(doc_id = topFRE$document, pledged = topFRE$pledged
                       , text= topFRE$blurb)

## assign emotions
for (i in 1:1000)
{
  indiv_text = nrc_text$text[i]
  tok <- quanteda::tokens(indiv_text)
  for (x in c("trust","fear","negative","sadness", "anger", "surprise","positive","disgust","joy","anticipation"))
  {
    result = sum(tok[[1]]%in%unlist(subset(nrc, sentiment==x, select=word)))
    nrc_text[i,x] = result
  }
}
## create column with emotions
colnms=c("trust","fear","negative","sadness", "anger", "surprise","positive","disgust","joy","anticipation")

for (emolex in c("trust","fear","negative","sadness", "anger", "surprise","positive","disgust","joy","anticipation"))
{
print(ggplot2::ggplot(data = nrc_text, aes( x = nrc_text[,emolex], y = pledged)) + 
  geom_col() + 
    scale_y_continuous(labels = scales::comma) + 
  labs(
    x = stringr::str_to_title(emolex),
    y = "Amount Pledged",
    title = paste(stringr::str_to_title(emolex), "Sentiment/Emotion")
  ) + 
   ggthemes::theme_tufte()) 
}

In the bar plots above, the data set only contains top 1000 successful projects. Projects with words that have positive association tend to have more pledged amounts, but like our sentiment analysis findings, blurbs that are too positive have diminishing returns. Emotions that are associated with positive sentiment such as trust, joy and anticipation all have more similar relationships with amount pledged (measure of success). Another important factor to note is how across all emotions/sentiments, the “0” value is dominant. This represents the limitations to using NRC Emolex dictionary, since it is also taking into consideration the corpuses that do not contain any matches from the dictionary.

## random_sent <- random_sent %>% 
 ##  mutate(rn = row_number())

## nrc_sent <- final_sent_dtm %>%
  ## inner_join(get_sentiments('nrc'), by='word')


## nrc_sent_n <- nrc_sent %>%
  ## group_by(sentiment)%>%
  ## tally %>%
  ## arrange(desc(n))

Submission

Please add the hash of your final commit in the feedback pull request to submit your homework. The homework is due on Monday, April 4 at 5pm.

Please stay honest!

If you do come across something online that provides part of the analysis / code etc., please no wholesale copying of other ideas. We are trying to evaluate your abilities to visualized data not the ability to do internet searches. Also, this is an individually assigned exercise – please keep your solution to yourself.